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Abstract 

In this paper, we propose and study random maxout features, which are con- 
structed by first projecting the input data onto sets of randomly generated vectors 
with Gaussian elements, and then outputing the maximum projection value for 
each set. We show that the resulting random feature map, when used in conjunc- 
tion with linear models, allows for the locally linear estimation of the function 
of interest in classification tasks, and for the locally linear embedding of points 
when used for dimensionality reduction or data visualization. We derive gener- 
alization bounds for learning that assess the error in approximating locally linear 
functions by linear functions in the maxout feature space, and empirically evaluate 
the efficacy of the approach on the MNIST and TIMIT classification tasks. 


1 Introduction 

Kernel based learning algorithms are ubiquitous in both supervised and unsupervised learning. For 
example, a universal kernel support vector machine approximates, to an arbitrary precision, any non- 
linear decision boundary function given enough training points fl). On the other hand, methods like 
Kernel Principal Component Analysis (Kernel PCA) capture non-linear relationships between 
variables of interest, and are used in non-linear dimensionality reduction. However, non-linear ker¬ 
nel methods suffer from high computational complexity (often cubic in the sample size), and are 
difficult to parallelize—training and testing on a even modestly sized dataset such as the TIMIT 
speech corpus (2M training samples) can be very challenging. Linear methods, on the other hand 
(linear support vector machines, logistic regression, ridge regression, Principal Component Analy¬ 
sis, etc. ), suffer from low capacity and representation power, but have computational complexity 
linear in the sample size, and so can be more readily scaled to large data corpora. Scalability and 
non-linear representation power are two desiderata of any learning algorithm. Deep Neural Net¬ 
works owe their success to this property, as they allow rich, non-linear modeling and they scale 
linearly in the sample size when trained with variants of stochastic gradient descent 13 . 

Kernel Approximation with Random Features. An elegant approach to overcoming the computa¬ 
tional load of kernel methods, pioneered by H, consists of generating explicit, randomized feature 
maps T» : A C > K™, where m is typically larger then the dimension of the input space n, to 
approximate the kernel K: 

For (x, z) e A, Ar(x, z) « (<i>(x), $(z)). (1) 

When used in conjunction with linear methods, such randomized features reveal the non-linear struc¬ 
ture in the data, and we gain scalability, as linear methods scale linearly with training sample size. 
Random Fourier features, introduced in approximate shift invariant kernels. For example, the 

Gaussian kernel, K{x,z) = exp approximated using the following feature 

map: 

<i)(x) = (cos(i(;7a: + h)... cos{wJ^x + bm)), 

where Wi ~ JV{0, -^Id), are independent gaussian vectors, and hi are independently and uniformly 
drawn from [0,27r]. Recently (Jl showed that a highly oversampled random Fourier features map 
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{m = 400-ftr), and a large scale linear least squares classifier, approaches the performance of dense 
deep neural networks trained on the TIMIT speech corpus. 

Learning with Random Features. More formally in a classical supervised learning setting, let 
X C be the input space and 3^ = {—1,1} be the label space in a binary classification setting. 
We are given a training set S = {(xi, yi) & X x y,i = \ ... N}. For kemel methods, the goal is 
to find a non-linear function / mapping X to y, given a certain measure of discrepancy or a loss 
function V. The function / is restricted to belong to a hypothesis class of functions Tix, the so 
called reproducing kernel Hilbert space (RKHS). Empirical risk minimization in that setup leads to 
a rich class of non-linear algorithms, via regularization in RKHS 0, 

1 ^ 
i—1 

where A > 0 is the regularization parameter and ||/|1^^ is the norm in the RKHS. The optimum f* 
of the problem in (|^ has the following form f*{x) = P* £ Solving for /3* 

may have a computational complexity of 0{N^) (Regularized Least squares) or 0{N‘^) (Support 
Vector Machines). Using an explicit feature map $ that approximates such a kernel in conjunction 
with a linear model, it is therefore sufficient to estimate a scalable regularized linear model with a 
computational complexity linear in the number of training examples, 

1 ^ 

min — ^V(t/„(a,$(a;i))) + A||af , (3) 

where a* is the optimal solution. For sufficiently large m we have: f*{x) « (a*, and a* can 

be find in 0{Nm) time using stochastic gradient descent. Similar ideas extend to the unsupervised 
case. Recently Q introduced the randomized non-linear component analysis, where it is shown that 
Kernel PCA can be approximated by using random Fourier features followed by a linear PCA. 

Contributions. In this paper, inspired by the recently introduced maxout network ijS), we intro- 
duce a simple but effective non-linear random feature map, called random maxout features, that 
approximates functions of interest with piecewise-linear functions. Locally linear boundaries and 
components are interesting as they carry locally the linear structure in the data, and have the advan- 
tage of being interpretable in the original feature space of the data, but how should such a kernel 
method be formulated? In principle, such a mapping couid be achieved via any locally linear ker¬ 
nel of the form K{x, z) = {x, z) k^^x, z), where Ka is a localizing kernel, such as, for example a 
Gaussian kernel with bandwidth a. However, how to efficiently realize such a conditionally linear 
kernel is not ciear; for example, achieving this via random Fourier features would involve taking the 
Kronecker product of a linear feature map and the random features. The main contribution of this 
paper is to introduce and analyze the random maxout feature map, which has the advantage that it 
can be leamed in 0{Nm) time {N training points, m random features), and utilized at test time in 
0{dm) time (assuming d input features), while avoiding the taking Kronecker products. When used 
in conjunction with linear methods, random maxout realize a scalable, local linear function estima- 
tor for large-scale classification and regression. In the unsupervised setting, similarly to l?!, random 
maxout features followed by a PCA allow for a locally linear embedding of the data, that can be 
used as a non-linear dimensionality reduction, and for data visualization. The paper is organized 
as follows: In Sectionj^we introduce our random maxout feature map, and show that its expected 
kernel is indeed locally linear. In section we present generalization bounds for the learning of 
linear functions in the random maxout feature space. In section]^ we discuss how random maxout 
features relate to previous work. Finally, in section]^ we demonstrate the approach as a local linear 
estimator in a classification setting on MNIST and TIMIT speech corpora. 

2 Random Maxout Features 

Random maxout features have the same structure as deep maxout networks in terms of maxout units 
.The following definition gives a precise description of the maxout random feature map: 

Deflnition 1 (Random Maxout Features). Let Wj, l = 1...m, and j = 1.. .q, be independent 
random gaussian vectors i.e Wj ~ A/'(0, Id)- Note = {w{ ... w^) 
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For X G we define a maxout random unit hi{x) asfoUows: 

he{x) = (plx, = max iuA^ x) , i = 1.. .m. 

j = l...q 

A maxout random feature map $ is therefore defined asfollows: 


$(a:) 



{hi{x),... ,hm{x)). 


In order to study this map, we shall consider, for 2 points x,z G the dot prod- 

uct: ($(a;),$(z)) = h£(x)h£(z). Consider first the expectation of ($(a;), $( 2 )): 

E(($(a;), $(2:))) = ^JJ^ilE(h^(x)h^(z)) = E(/ii(a:)/ii(z)), where the last equality follows 
from the independence of the units. It It is therefore sufficient to study the expectation of the dot 
product of one unit; 

K(x,z) = E{h{x)h{z)), 

where h{x) = {wj,x), Wj ~ A/’(0, Id),j = 1 ■ • • 9 , iids. 

Theorem 1 (Maxout Expected Kernel). Let x,z G The expected kernel of maxout random units 
is given by the following expression: 

K{x, z) = E{h{x)h{z)) = a^{q) {x, z) Kq{x, z), 

where <j‘^{q) = E(maXj=i ^ 9j ~ A/'(0,1) iid, and Kq{x, z) is a non-linear kernel given by: 

N 00 

arg max {wj , x) = arg max {wj , z) 

J = l-9 J = l-9 J 

where the first 3 coejficients are ao(g) = ^,ai{q) = , a 2 {q) = where hfiq) = 

E4>i(iaaxk=i...q 9k)), where 9j,j = 1... q are iid Standard centered gaussian, and fi, the nonnal- 
ized Hermite polynomials. afiq) are non negative and X]i>o 

Proof of Theorem^ The proof is given in Appendix A in the supplementary material. □ 

2.1 Discussion of the Derived Maxout Kernel 

The expected kernel of a maxout unit is therefore a locally weighted linear kernel, and hence it 
allows a non-linear estimation of functions in a piecewise linear way : 

K{x, z) = a^{q) {x, z) Kq(x, z), 

where Hq is a non-linear kernel. Let p = (a;, z). In this section we discuss the locality introduced by 

Kq(., .). 

It is important to note that 0 < Kq{x, z) < 1, since Kq{x, z) = V(D{x) = D{z)), where D{x) = 
argmaxj^i . q {wj,x). For simplicity assume that x,z G where is the unit sphere in d 

dimensions. We start by giving values of Kq in three particular cases of interest: 

1. When X and z coincide i.e x = z, and p = 1, we have Kq{x, z) = 1, as J2i>o “*(?) = 

2. When x and z are orthogonal i.e p = 0, we have Kq{x, z) = ao(g) = K 

3. When x and z are diametrically opposed i.e x = —z, and p = —1, we have Kq{x, z) = 0, 
as E*>o « 2 *(< 7 ) = E*>o a2i+i{q) = 5 EU . 

In order to understand the locality introduced by the non-linear kernel Kq, and the relation of the 
radius of the locality to the size of the pool q, looking to the first order expansion of Kq gives us 
a hint on the effect of that kernel. In particular the quantity hi{q) is just the expectation of the 
maximum of independent gaussians hi{q) ~ log(g) lfT0ll26l . 

Kq{x,z) = ao{q) + ai{q)p + 0{p'^) 

= ^(l + (l + e( 9 )) 21 og(q)p)+ 0 (p^), 
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where e{q) —>■ 0, for q —>■ oo. 

Note by g the function, g : [—1,1] —>■ [0,1] such that Kq{x, z) = g{p). For far apart points, when 
p —> — 1, g{p) Q. g has a linear behavior around p = 0, with a slope equal to ?i2lM Note that 
in this neighborhood as q increases the linear regime vanishes, and g{p) —> 0. Hence as q increases 
the probability of two points hashing to same index of maximum becomes smaller; qualitatively the 
radius of the locality of shrinks as q increases. Finally for near by points when p —> g{p) —> 1. 
Qualitatively the derived kernel K{x, z) « 0 for far apart points and K{x, z) « cr^(g) (a;, z) for 
points in the same neighborhood, where the radius of the locality, and the notion of closeness is set 
by the choice of the size of the pool q. This radius is decreasing in q. Hence K defines a locally 
linear kernel. 

Now if we go back to problem (|^, and solve for / in the reproducing kernel hilbert space of the 
equivalent kernel K (i.e for % = T-Lk), we have : 

N N j N \ 

f*ix) = ^l3*iK{x,Xi) =a'^{q)^j5* {x,Xi) Kq{x,Xi) =(7'^{q)l'^l3*Kq{x,X,)Xi,x\. (4) 

i=l i=l \i=l / 

Hence we see that this derived kernel allows a locally linear estimation of the function of interest 
/*, where the radius of the locality is set by the choice of the size of the pool q. 

Now consider the maxout random feature map $ introduced in Definition[T] Recall that we have: 

E(($(a:), «'(z))) = K{x, z) = a'^{q) {x, z) Kq{x, z), 

the dot product ($(a;), <i)(z)) is therefore an estimator of K{x,z) i.e for sufficiently large m, 

{^{x),^{z)) « K{x,z) , hence we can use the feature map T», and a simple linear model as in 

equation ([^ , and use the optimal weight a* to get an estimate of the locally linear estimation /* 
produced by the derived kernel as in equation we have for sufficiently large m\ 

{a*Mx))^ r{x). (5) 

In the next section we analyze the errors incurred by such approximation and how it translates to the 
convergence of the risk to its optimal value in a dense subset of the RKHS induced by the locally 
linear kernel. 

Remark 1 (Locality Sensitive Hashing). Let C : —>■ {1... g}™, such that for x G 

C{x) = (argmaxj-^i g > • ■ • >argmaXj-^i ^ (w™,a;)) , 

= l. ■ .mj = l.. .q. For x,z £ we have : E YJiLi ^Ci(x)^Ci{z)) = 

P ^ {wjjx) f argmaXj_]^ ^ {wj,x)^ = 1 — Kq{x,z). Hence we can approximate 

the local kernel Kq(x, z), by the non binary strings by mean of the hamming distance between the 
q-ary strings C(x), and C(z). Ai m becomes large we have: 

^ m 

dH{C{x),C{z)) = — ^ lc,(x)^Ci(z) « 1 - Kq{x, z), 

Hence C defines a locality sensitive hashing scheme in the sense ofm- 

3 Learning with Random Maxout Features 

We Show in this section that learning a linear model in the random maxout feature space, allows 
for a locally linear estimation of functions in a supervised classification setting. The locally linear 
kernel K{x, z) = <J^{q) {x, z) Kq{x, z), defines a Reproducing Kernel Hilbert Space (RKHS). In the 
following we will see how linear functions in the random maxout feature space approximate a dense 
subset of this locally linear RKHS. We start by introducing some notation. We assume that we are 
given a training set S — {(xi, t/Q, Xi G A4 — AI Ci G f = {—1,1}, * = 1 ■ • • W}. Our 

goal is to learn a function f : A4 U via risk minimization. Let Py(x) be the label posteriors and 
assume A4 is endowed with a measure p^, the expected and empirical risks induced by a L-Lipchitz 
loss function V : K —> [0,1] are the following: 

f . 1 ^ 

£v{f)= / ^V{yf{x))py{x)dpM{x),£v{f) = j^y2^^yit{xi))- 
v^y i=i 
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The assumptions on the points belonging to the unit sphere, and on the loss being bounded by one 
can be weakened see Remark|^ We will use in the following a notion of intrinsic dimension for the 
set A4, namely the Assouad dimension given in the following definition: 

Deflnition 2 l lfT^ '). The Assouad dimension of Ai C , denoted by dM . the smallest integer 
k, such that, for any ball B C M.‘^,the set B D A4 can be covered by 2^ balls ofhalf the radius of B. 

The Assouad dimension is used as a measure of the intrinsic dimension. For example, if Ai is an ip 
ball in then dM = 0{d). If Ad is a r-dimensional hyperplane in K’’, then dM = 0{r), where 
r < d. Moreover, if Ad is a r-dimensional Riemannian submanifold of with suitably bounded 
curvature, then dM = 0{r). 

Let = (wf,... Wg), f = 1... m, and W = (wi,... Wq), since Wj,j = 1... 9 are iid, the 
distribution ofW is given byp(kF) = p{wi).. .p{wq), where p{wj) is the distribution of a gaussian 
vector drawn form Af{0, Id)- Similarly to the analysis in llT3l . let C > 0 , we define the infinite 
dimensional functional space B: 

= |/(a;) = J a{W)f{x, W)dW, sup < c| , 

it is easy to see that B is dense in Hk m. We will approximate the set B with B defined as 
follows: 

f C 

^ = S = V™ (a, $(a;)) = ^ aifix, W^), ||a||^ < — 

l 1=1 ™ 

Note that in this definition of this function space we are regularizing the norm infinity of the weight 
vectors this can be replaced in practice, and theory iflTll by a classical Tikhonov regularization or 
other forms of regularization. 

Theorem 2 (Learning with Random Maxout Features). Let S = {{xi,yi),Xi G Ad = An 
G {—l,l},z = 1... A^}, and dM the assouad dimension of AA,and diam(Ad) 
be its diameter. Let /jv = argmin^gjpfy(/). Fix 6 > 0, e G (0,1), for m > 

^ (fi-M log _|_ log ((7 -|- 1)^, where C' is a numerical constant, we have: 

(' + fMj)) ■ 

with probability at least 1 — 3d — on the choices of the training examples and the random 

projectioris. Where C is the regularization parameter in the definition of B and B, L is the lipchitz 
constant ofthe loss function 1^ : M —> [0, 1 ], and <T^{q) = E (maxj^i ^ , Qj ^ Af{0, 1) iids. 

The proof of Theorem]^ is given in the supplementary material in Appendix B, the main techni- 
cal difficulty consists in bounding the sup^.^^ W) \ , and relating this quantity to the intrinsic 
dimension Am- Theorem]^ shows that for q > 1, learning a linear model in the maxout feature 
space defined by the map $ has a low expected risk and more importantly this risk is not far from 
the one achieved by a nonlinear infinite dimensional function class B. Locally linear functions can 
be hence estimated to an arbitrary precision using linear models in the maxout feature space, the 
errors decompose naturally to an estimation or a statistical error with the usual rate of and 

an approximation error of functions in the infinite dimensional space B, by functions in B. For a 
fixed q, in order to achieve an approximation error e, the bounds suggests e = . One needs to 

set the dimensionality m of the feature map $ to O {N {dM log(d) + log{q + 1))), where dM is 
a measure of the intrinsic dimension of the space where inputs live dM ^ d- For instance if our 
data lived on a r-dimensional Riemannian submanifold of K‘^(r << d), the function space B for 
m = O {N (r log(d) + log(g))) achieves an approximation error e = of a dense subset of the 
function space defined by the local linear kernel, with a radius of locality set by the choice of the 
parameter q. As q increases this radius shrinks and the dimension of the feature map increases to 
ensure more locality but with a logarithmic dependency on q. The use of the intrinsic dimension of 
the inputs space Ad -that we borrow from the compressive sensing community- in the approxima¬ 
tion error is appealing as most of previous bounds in random features analysis relates the number 
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of prqjections only to the training size N, and spectral properties of the kernel matrix llT4l . lfT3l . 
Using spectral propreties of the kemel, results in m suggest that for large N that the number of 
the features if of the order 0(iV log(A^)) . While the spectral properties of the kernel carry some 
geometric Information about the points distribution it misses some important geometric structure in 
the points set M., since it captures some intrinsic dimension of the data that can be expressed only 
in term of the sample size N, while the Assouad dimension has a richer description of the structure 
in the data, such as sparsity for instance. If X was the set of s— sparse signal the Assouad di¬ 
mension (Im = 0{s log(d))|[T2l, and we need O maxout random features to have 

an approximation error of e. It would be interesting to incorporate in the bound both the spectral 
properties of the kernel and the intrinsic dimension to get the good of the two worlds, we leave this 
for a future work. For q = 1, Maxout random features reduces to classical random projections, that 
approximate the linear kernel, learning classifiers from randomly projected data has been throughly 
studied see lITSl . and references there in, Theoremj^is not as sharp as results presented in ifTSll . since 
the proof was not specialized to the linear projection case. 

Remark 2. 1-We can relax the sphere constraint on the input set to a bounded data constraint, i.e 
sup^-g;)^ ||a:|| < R, and assume a bounded loss \V (z)| < B, a minor change in the proof shows that 
the right hand side of the inequality in Theorem^is multiplied by RB. 

2-Note that for e = we have m = 0{Nd_\4 log(d)), for large N assume Ai was finite and 

the cardinality \Ai\ = N'^ for small a, we have djvt = 0(log(|Af|)) = 0{log{N)) and m = 
0{N \og{N) log(fi)), which matches up to a log term results in M4]l . 

4 Related Work 

Approximating Kernels, Random Non Linear Embeddings. The so called Johnson- 
Lindenstrauss Lemma ifTril States that a linear random feature map preserves £2 distances in a N— 
point subset of a Euclidian space when embedded in 0{e~^ log(A^)) dimension with a distortion of 
1 + e. The requirement of preserving all pairwise distances is not needed in many applications; we 
need to preserve distances only in a local neighborhood of the points of interest. This observation is 
at the core of locality sensitive hashing m and has been discus sed in ini. One needs a non-linear 
random feature map in order to achieve a local embedding. Random Fourier features a approxi¬ 
mating the Gaussian kernel achieve such a goal. Random maxout features also achieve such a goal 
by performing a locally linear embedding of the points. 

Scaling up Kernel Methods. As discussed earlier random features is a popular approach in approx¬ 
imating the kernel matrix and scaling up kernel methods pioneered by a, the generalization ability 
of such approach is of the order of -f which suggests that m needs to be 0{N). An 

elegant doubly stochastic gradient approach introduced recently in ca, uses random features to ap¬ 
proximate the function space rather then the kernel matrix, in a memory efficient way that achieves 
this 0{N) bound for the number of features, Maxout random features can be also used within this 
framework. Other approaches for scaling up kernel methods fall under the category of low rank 
Approximation of the kernel matrix, such as sparse greedy matrix approximation na, Nystrom ap- 
proximations 1201 and low rank Cholesky decomposition ED. 

Locally Linear Estimation. As discussed earlier, Random Maxout Networks allow us to do lo¬ 
cal linear estimation of functions in supervised and unsupervised learning tasks, among other ap¬ 
proaches Deep Maxout Networks O, Locally linear Embedding ll^ and convex piecewise linear 
fitting ll2^ . share similar structure with Random maxout features. 

5 Numerical Experiments 

5.1 Simulated Data Illustration 

In this section we consider 100 points generated at random form the unit circle in two dimensions. 
We embed those points through the Maxout feature map $, for m = 1000, and q = 2^ and q = 2^ 
respectively. We plot in Eigure[2 for pairs of points x and z, the pairwise distances in the embedded 
space ||$(a;) — $(-2)|| versus the pairwise distances in the original ||x — 2 || (we show here only a 
subset of those pairwise distances). We see that in both cases for small distances we have a linear 
regime, followed by a saturation regime for high range distances. The saturation arises earlier for 
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<7 = 2® when compared to the one of g = 2^. This confirms the discussion in Section 2.1 on the 
effect of the Maxout feature map as a locally linear kernel, with the radius of the locality shrinking 
as q increases. To fuither illustrate this local linear embedding, we consider a dataset of 33 images 



Figure 1; The locality induced by the size of the pool q. As q increase the locality radius shrinks. 

of faces of the same person of size 112 x 92 at different angles that we normalize to be unit norm 
(See Figure]^ .The faces are ordered by their angles, the ordering provided in this dataset is noisy. 
We extract the maxout features on this dataset for m = 10000, q = 12, and perform principal 
component analysis on the data in the maxout feature space and project it down to two dimensions 
on the two largest principal components. We show in Figure]^ the embedding of this dataset in 
two dimensions through Maxout followed by a linear PCA . Each point in this scatter corresponds 
to a face, the numbering refers to the corresponding order in the given angle labeling. We see that 
Maxout local linear embedding self organizes the data with respect to the angle of variation, and 
corrects the noisy labeling. 


5.2 Supervised Learning Applications: Classiflcation 

In order to perform classiflcation as discussed earlier we lift the data through the maxout feature 
map $, and solve a linear classiflcation problem in the lifted space as described in Equation ([^. 


5.2.1 Digit Classiflcation 

We extract the Maxout random features on the MNIST dataset 0, consisting of 60000 training 
examples {d = 784) and 10000 test examples among T = 10 digits. Let Z — $(Ar) S K^^^be 
the embedded data and Y G be the class labeis using a ±1 encoding. We solve the multi- 

class problem using a simple regularized least squares, f{x) = {a,^{x)), where a = {Z^Z + 
XI)~^Z^Y, where A is the regularization parameter chosen on a hold out validation set EI- 
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Eigure 2: Maxout Locally Linear Embedding in 2D (Maxout - LEE) of 33 faces of the same person 
at different angles. 
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m = 100 

m = 500 

m = 1000 

m = 5000 

m = 10000 


= 1 

17.68 ±0.28 

14.76 ± 0.23 

14.64 ±0.08 

15.02 ±0.64 

15.12 ±0.58 

g 

= 2 

16.95 ±0.5321 

7.98 ±0.22 

5.62 ±0.18 

2.86 ±0.11 

2.35 ±0.04 

g 

= 4 

18.24 ±0.39 

7.70 ±0.24 

5.50 ±0.16 

2.78 ±0.07 

2.23 ±0.07 

g 

= 8 

18.9 ±0.28 

7.98 ±0.47 

5.59 ±0.19 

2.81 ±0.10 

2.32 ±0.09 

g ■- 

= 16 

20.57 ±0.76 

8.19 ±0.21 

5.66 ±0.15 

2.87 ±0.12 

2.58 ±0.04 


Table 1; Random Maxout on MNIST; Error rates in %. 


In table [T]we repoit test errors and Standard deviations for various values of m and q in the maxout 
feature map averaged on 5 different choices of the random weights in the map. We see that for 
< 7 = 1 , where the feature map does not introduce any non linearity, the performance of the map 
for any value of m, matches the error rate of a linear classifier that is 15%. For g 7 ^ 1, we start to 
see the non linearity introduced by the map as a local linear estimator, for a fixed q the error rate 
decreases as m gets large. In this experiment the best error rate is achieved for m — 10000 and 
<7 = 4, suggesting that q = i sets the optimal radius of locality for classification. As a baseline an 
optimal k— nearest neighbor achieves an error rate of 3.09 %. 


5.2.2 Phone Classification on TIMIT 


We further evaluated random maxout features on the TIMIT speech phone classification task. Eval- 
uations are reported on the core test set of TIMIT. We utilized essentially the same experimental 
Setup as in 0 : 147 context independent States were used as classification targets; at test time each 
utterance was decoded using the Viterbi algorithm, and then mapped, as is Standard, down to 39 
phones for scoring. As in 0,2 million frames of training data—fMLLR features of dimension 40 
each m. spliced with ±5 frames of context (d = 11 x 40 = 440)—were utilized. These features 
were then lifted hrough the random maxout map T», and a multinomial logistic regression was sub- 
sequently trained using SGD to minimize cross entropy loss. Table [^reports the mean and Standard 
deviation of the performance of random maxout units as a function of number of maxout features, 
m, and number of projections/feature, q. Interestingly, even smaller feature maps far outperform 
using the raw features, and the performance varies very little with initialization seed (5 seeds/result). 


m 

2 

q 

4 

8 

16 

1250 

24.7± 0.2 

24.4± 0.2 

25.6± 0.2 

26.0± 0.2 

2500 

24.0± 0.2 

23.3± 0.3 

24.7± 0.3 

25.3± 0.3 

5000 

23.5±0.1 

22.9± 0.2 

24.7± 0.2 

24.7± 0.4 

10000 

23.2± 0.1 

22.5± 0.2 

24.5± 0.2 

24.7± 0.4 

20000 

23.1±0.1 

22.3 ± 0.2 

24.3± 0.2 

24.5± 0.2 


Table 2: Phone error rate (PER, %) as a function of number of maxout features,m, and number of 
linear projections per maxout feature,q, on the TIMIT speech phone classification task. Multinomial 
logistic regression on the input features yields a PER of 33.1 ± 0.1%. 


Table summarizes preliminary investigations into scaling up the size of the feature map, where 
to increase the number of features, projections are shared across random maxout units. Random 
maxout features appear to perform similarly to random Fourier features on the task. 
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network 

# features (m) 

#projections 

#proj ./feature (q) 

phone error rate (PER) 

Random Maxout 

15K 

15K 

q=4 

23.1 

Random Maxout 

60K 

15K 

q=4 

22.7 

Random Maxout 

60K 

60K 

q=4 

22.8 

Random Maxout 

400K 

15K 

q=4 

22.4 

Random Maxout 

300K 

300K 

q=4 

22.1 

Random Fourier 

400K 

400K 

- 

21.3 (5) 

ReLU DNN 

4K, 

16K (4Kx 4 layers) 

- 

22.7 (25l 

ReLU DNN w/ dropout 

4K, 

16K (4Kx 4 layers) 

- 

19.7 (251 


Table 3; Phone error rates (PER,%) on TIMIT. The total number of projectioris used to produce each 
feature map are as indicated (here random maxout features draw from a shared pool of projectioris). 


In this paper we presented random maxout feature map as an effective and scalable local linear 
estimator, and derived risk bounds for leaming in this feature space that assesses both statistical 
and approximation errors, in a classification setting. We believe that maxout features, thanks to 
their conditionally linear structure, can gain fuither in scalability, and speed, by leveraging the fast 
Johnson Lindenstrauss transform, and the doubly stochastic optimization framework of ca. 

A Proof of Theorem 1 

Pwof ofTheorem 1. Assume without loss of generality that ||x|| = ||z|| = 1. Let D{x) = 
argmaxj^i . q {wj,x), and D{z) = argmaxj^i , ^ {wj,z), ties are broken arbitrarily. By total 
probability we have: 

K{x, z) = E{h{x)h{z)) 

= E{h{x)h{z)\D{x) = D{z)}V{D{x) = D{z)) 

+ E{h{x)h{z)\D{x) ^ D{z)}¥{D{x) ^ D{z)) 

= E(('iUD(a;),a;) {wd{x),z) \D{x) = D{z))F{D{x) = D{z)) 

+ E {(wr,(2;),a;) {wd(z),z) \D{x) ^ D{z)} P(U(a:) ^ D{z)). 


It is easy to see that the second term in this sum is zero since the gaussians are independent and zero 
centered : E {{wo(x),x) {wo{z), z) \D{x) ^ D{z)'\ = 0. We are left with the hrst term of this 
sum: 

K{x, z) = E{h{x)h{z)) 

= ^{{wd(x),x) {wd{x),z) \D{x) = D{z))¥{D{x) = D{z)) 

= qE{{wi,x) {wi,z) \D{x) = D{z) = l)V{D{x) = D{z) = 1) 

= E{{wi,x) {wi,z) \D{x) = D{z) = l)V{D{x) = D{z)). 

By rotation invariance of gaussians we have: 

{wi,x) = g and {wi,z) = {x,z) g + y/l-\{x,z) ph, 

where g and h are independent random gaussian variables g,h ^ A/^(0,1). 

Let E be the following event: 

E = {g\s the maximum of q independent gaussians} 

Hence we have: 


E((uii,a:) {wi,z) \D{x) = D{z) = 1) 


= E 


X, z))g + i/l - I (x, z) \^h)\E^ 


j)E 


n 2N 


max gj 
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(6) 


Let = E ^[maxj=i ^ 9j]j^ we have finally: 

E{h{x)h{z)) = cr^((z) {x, z) V{D{x) = D{z)). 

a'^{q) is a normalization factor and it is well known that (7'^{q) log(g), hence we are left with 

P(f?(a;) = Diz)), 

that is the probability that x and z are not separated by the q hyperplanes, an object that is well 
studied in q— ways graph cuts approximation algorithms. 

The following lemma is crucial to our proof and is proved in 1261 . and allow us to get the final 
expression of the expected kernel. 

Lemma 1 (||26l). For X, z G I |a;|| = | |z| | = 1. Let p = {x, z), we have therefore: 

OO 

z) = F {D{x) = D{z)) = ^ (7) 

the taylor series of Kq around p = 0, converges for all p in the range \p\ < 1. The coefficients 
ai (q), of the expansion are all non negatives and their sum converges to 1. The first 3 coefficients 

are ao{q) = l,ai{q) = 02 ( 13 ') = ^qli)lf_ 2 ) - = E(j),{ina-yik=j...qr]j)) , where = 

1. .. q are iid Standard centered gaussian, and fi, the normalized Hermite polynomials. 

By lemma[T]we have finally: 

K{x,z) = E{h{x)h{z)) = a‘^{q) {x,z) Kq{x,z), (8) 

where Kq{x, z) = ®i(9)((3^) -z))% is a non-linear kernel, values of ai{q) are given in the above 

lemma. □ 


B Learning with Random Maxout Features 


In this section we state the proof of Theorem 2. We start with a preliminary Lemma that bounds 
(j>{x, W) uniformly on the set X, this will be crucial in our derivations. 

Lemma 2 (Bounding sup^g^ \f{x, VV)\). Let Xi = X Let dj^ be the Assouad dimension 

of Jvi and diam(AI) be the diameter of M.. Let (5 > 0, we have for a numeric constant Cp. 


sup \ f{x,W)\ = sup max {wj,x) 

xGA4 xGA4 3 — 


<Ci 


\ 


dM log 


diam(7W)v/d 


+ log (9 + 1 ), 


with probability at least 1 — S — 2e 


Proof Consider an e—Net that covers X with balls of radius r and centers We 

have by definition of the Assouad dimension, the maximum number of balls T is less than 


^ 2 diam(At) ^ . Assume we have: \(j){x,W) — h{z,W)\ 


{wDix),x) - {wd(z),z), meaning 


{wd{x),x) - {wd{z),z) > 0 . 


(j){x,W) — (j){z,W) = 


< 


{wd(x),x) - {wd(z),z) 
{Wd(x),X- z) 

{wd(z) - Wd{x),z) 

>0 

\\wd(x)\\2\\x- Z\\^, 


where the inequality follows from the definition of D{z), and the Cauchy-Schawrz inequality. Sim- 
ilarly if we have Ifix, W) — (j){z, W)\ = {tu£)(z), z) — (wj^/^^px), we have: 


f{x,W)-(j){z,W)) < ||wD(z)|| 2 ||a: - z ||2 
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We conclude therefore that: 

\(j)[x,W) - (j){z,W)\ < max(||u;£,( 3 ,)|| , ||ui£,(^)||) ||a;-z ||2 < (||wn( 2 :)|| + ||m^d(z)||) \\x - z\\^ 
Let L = ||■lC£,(a ;)||2 + ||icn(z)|| 2 - 

Let i > 0, we have sup^.^^ \4'{x^ W)\ < t, if the following two events hold: 

Ei = \ sup \(l){xi,W)\ < \\ wA E 2 = 

[^Xi,i—1...T ^ J 



On the first hand; 


p(£;j) = p sup \(i){xi,W)\> - 

= vLt^mx,,W)\>U 


T 

<^p(^|<^(xi,VP)|>0 


i=l 

= TP 


max {wj , x) 

j = l...q 


t 

> - 
- 2 


Note that by a union bound we have: 


max^ {wj,x) > 0 = P (bJ, {wj,x) > ^ < gP((u;,a;) > ^) < qe 


and by independence of wj we have also: 


(/=f^g K-2;) < = P {Wj,x) < -0 = ^P((w,a:) < < e 


Putting together theses to bounds we have: 

max {wj, x) 

j = l...q 

The covering number T of A” is also bounded as follows M- 

f2diam{M)Y^ 


> i) + 


T < 


n - 




Hence we have for g > 1: 


p(p;f) < ( 


f2Ai am{M) V^ 




On the other hand, for a universal constant c, and for e G (0,1) 

P(||tt ;||2 > '/d(l + e)) < e““ 
Set E — y/d(l + e), hence P(ii^ 2 ) — 2e 


— C£^d 


It follows that for f > 1: 


(sup \4>{x, W)\ >t)< P(i?c U EI) 
x^Ai 

<F{El)+F{E^,) 


< 


f 4v/d(l + g)diam(Af) \ ^ 




2e 


— C6^d 


dM 


< ^4v/d(l + e)diam(Af(g + l)e ‘ /® + 2e 
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Hence for e = > 1: 


sup \(l){x,W)\ < t, 
xeM 

cIm 


with probability at least 1 — ^6diam(A^)'\/d^ {q + l)e * — 2e 

Hence we bave for a numeric constant Ci: 


sup |0(a;, W)\ < C 
x^M 


‘\ 

witb probability at least \ — 5 — 


dM log 4. iog(g 


□ 


Tbe following Lemma sbows tbat any function f G can be approximated by a function f G T: 
Lemma 3. [Approximation Error. ]Let f be a function in T. Thenfor <5 > 0, there exists a function 
f G T such that: 


/-/ 


< CCi 


\ 


^^l0g(dia„X(^)^l0g(^^l) 


l + x/21og( i 


c-^{.x.pm) 
with probability at least 1 — 2(5 — 

Proof of Lemma^ Let / G !F,f{x) = J a{W)(j){x,W)dW.Let fi{x) = (j){x, bL^).We 

bave tbe following: 'Ewife) = f, and ■:^^w{J2T=i M ~ /■ Consider tbe Hilbert space 
£2 (A’,p^), witbdotproduct: {f, 9 )c-^^x, pm) = Ix f(^)9ix)dpMix). 

WfiWcHx.pM) = {(t>{x,W^))^dpM{x), 

Let E and F be tbe event debned as follows: 

E = \ sup VL)|) < M 
Vx<^M 


F = 




f=i 


> t 


cHx,pm) 


Conditioned on E we bave: 

\\fe\\c^{x,pM) < CM. 

P (F) = P (F|F) P(F) + P (F|F^) P(F^) < P (F|F) + P(F^). 
Conditioned on tbe event E, we can apply McDiarmid inequality and we bave: 

For 5 > 0 set M = Ci^ dx\ log _|_ log((^ + 1) applying Lemma 

P(F“) < 1 - (5 - 

We bave tberefore witb probability 1 — 5 — — (5i: 


we bave: 


m ^^ 


J = t 


< CCi 


\ 


dM log (- 


diam(A^)-\/d\ 


■ log(g +1) 


1 + 


/21 og(- 


C-^iX.PM) 


(9) 

□ 
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The following Lemma shows how the approximation of functions in T, by functions in T, transfers 
to the expected Risk: 

Lemma 4 (Bound on the Approximation Error). Let f G J-, fix S > 0. There exists a function 
f G F, such that: 


£v{f)<£v(f) + LCC, 


\ 


d>tlog( """^^f^^ )+log(g+l) 


m 


l + ^/21og( - 


with probability at least 1 — 25 — 2e 


Proof of Lemma^ SvU) - £v{f) < Jx ^ iv f (x)) - V {y f (x)) dpM{x) < L j^\f{x) - 


f-f 


where we used the Lip- 


f{x)\dpMix) < lJ^ {f{x) - f{x)YdpM{x) = L „ „ 

* CG{X,pm) 

chitz condition and Jensen inequality. The rest of the proof follows from Lemma|^ □ 


We are now ready to pro ve Theorem 2. 


ProofofTheorem2. Let /at = argmin^g^£:y(/), / = argmin^g^£:y(/), f* = 

argmin^gjr£:y(/). 

PvUn) - mm£y(/) = (PvUn) - £v{f)) + (fy(/) - £v{n) 


Stati Stic al Error 


Approximation Error 


Bounding the statistical error. The first term is the usual estimation or statistical error than we can 
bound as follows: 

£v{fN) - £v{f) = (£v{fN) - £v{fN)^ + (£v{fN) - £v{f)'j + (pvif) - £v{f) 


<0,by optimality of 


< 2 sup 
/e^ 


£v{f)-U{f) 


Assume that the loss V : K —)■ [0,1], when the data {xi, yf) or the random projections change 


SUP/6^ 


£v{f)-£vif) 


, can change by no more then ^ then by applying McDiarmids inequality 


we have with a probability at least 1 — 6/2 


sup 

fef 


£v{f)-£vif) 


< sup 

V/e^ 


£vif)-£v{f) 


+ 


21og(2/^) 


N 


Now using the classical rademachar complexity type bounds 11291 . we have: 

E,.H.sup £v{f)-£vif) <2L7^Jv(.F) + ^'^, 

/g/- V a 


where TZn{1P) is defined as follows: 

TZn[iP) = ^x,w,o 


sup 

./6-^ 


1 ^ 


where ai are iid Rademacher variables G {—1,1}, such that P(cri = 1) = P(cri = —1) = ^. 
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It is sufficient to bound the Rademacher complexity of the class where the expectation is taken 
over the randomness of the data and the random features; 


= ^x,W,o 


= E, 


X,W ,(7 


sup 

1 ^ 


sup 


1 V 



- 

^ m N 



sup 

_/e-F 




N 




i=l \f=l 


< ll«llooX! 


^=1 


N 


^ crj(/> (x,, 


By Holder inequality: (a, &) < ||a|| 


< 


C 

mN 


^x,W ^ A 
r=i \ 


N 


Ecr j ai(j) {xi, W^) j Jensen inequality, concavity of square 


root 


\i=l 


Note that E((TitTj) = 0, for i ^ j it follows that: 

(E^i Efci Ej^i {xi, (j) {xj,w^) = 4>^ {xi, 

Finally: 


C 


t^n{P) < ■:^^^x,w 




N 




= -j^^x,W 


N 


^^</>2 ix,,W) 


C 

< — 
- N 

C 


\ 


N 


E 


x,W 


(/)2(a;i, VF) j By Jensen inequality 


^i=l 


= N^x,w4''^{x,W) 


< ^VE^ {K{x,x)). 

Recall that for x G K{x, x) = cr'^[q) ||a;||^ Kq{x, x) = cr'^iq)- Hence: 


n,n{f) < C 




N 


hence we have with probability 1 — 6/2, on the choice of random data and random projections: 
£v{fN)-£v{f)<^LC 


^"(9) , 2|1/(0)| , /21og(2/(5) 


N 


Vn 


N 


( 10 ) 


Bounding the Approximation Error. Let /*, the function defined in Lemma that approximates 
/* in IF. By Lemma |^we know that; 


£v{h < £v{n + Lcci\^ 


rfAtlog( ""°^^f^^ )+log(g+l) 


m 


1 + W21og 


with probability 1 — 25 — 2e on the choice of the random projections. By optimality of / S 
we have with at least the same probability 1 — 25 — 2e~‘^‘^^^ 


£vif) < £v{h < £v(n + LCCi\^ 


rfAtlog( ""‘”^r^ )+log(9 + l) 


1 + A 21og 
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Hence by a union bound with probability 1 — 3(5 — 2e on the training set and the random 

projections: 


^vOn) 


— minSvif) < '^LC 



2|F(0)| , , /21og(l/(5) 


LCCi 


\ 


dM log ( 


diam(A1)\/d'\ 


\og{q 


m 



□ 
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